<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>N-gram tokenizer | ElasticSearch 7.7 权威指南中文版</title>
	<meta name="keywords" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <meta name="description" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
	<script>
	var _link = 'analysis-ngram-tokenizer.html';
    </script>
</head>
<body>
<div class="main-container">
    <section id="content">
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-ngram-tokenizer.html" rel="nofollow" target="_blank">https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-ngram-tokenizer.html</a>, 原文档版权归 www.elastic.co 所有<br/>本地英文版地址: <a href="../en/analysis-ngram-tokenizer.html" rel="nofollow" target="_blank">../en/analysis-ngram-tokenizer.html</a></div>
                        <!-- start body -->
                  <div class="page_header">
<strong>重要</strong>: 此版本不会发布额外的bug修复或文档更新。最新信息请参考 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html" rel="nofollow">当前版本文档</a>。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch Guide [7.7]</a></span>
»
<span class="breadcrumb-link"><a href="analysis.html">Text analysis</a></span>
»
<span class="breadcrumb-link"><a href="analysis-tokenizers.html">Tokenizer reference</a></span>
»
<span class="breadcrumb-node">N-gram tokenizer</span>
</div>
<div class="navheader">
<span class="prev">
<a href="analysis-lowercase-tokenizer.html">« Lowercase Tokenizer</a>
</span>
<span class="next">
<a href="analysis-pathhierarchy-tokenizer.html">Path Hierarchy Tokenizer »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="analysis-ngram-tokenizer"></a>N-gram tokenizer<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc">edit</a>
</h2>
</div></div></div>
<p>The <code class="literal">ngram</code> tokenizer first breaks text down into words whenever it encounters
one of a list of specified characters, then it emits
<a href="https://en.wikipedia.org/wiki/N-gram" class="ulink" target="_top">N-grams</a> of each word of the specified
length.</p>
<p>N-grams are like a sliding window that moves across the word - a continuous
sequence of characters of the specified length. They are useful for querying
languages that don’t use spaces or that have long compound words, like German.</p>
<h3>
<a id="_example_output_14"></a>Example output<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc">edit</a>
</h3>
<p>With the default settings, the <code class="literal">ngram</code> tokenizer treats the initial text as a
single token and produces N-grams with minimum length <code class="literal">1</code> and maximum length
<code class="literal">2</code>:</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">POST _analyze
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/871.console"></div>
<p>The above sentence would produce the following terms:</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]</pre>
</div>
<h3>
<a id="_configuration_15"></a>Configuration<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc">edit</a>
</h3>
<p>The <code class="literal">ngram</code> tokenizer accepts the following parameters:</p>
<div class="informaltable">
<table border="0" cellpadding="4px">
<colgroup>
<col>
<col>
</colgroup>
<tbody valign="top">
<tr>
<td valign="top">
<p>
<code class="literal">min_gram</code>
</p>
</td>
<td valign="top">
<p>
Minimum length of characters in a gram.  Defaults to <code class="literal">1</code>.
</p>
</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">max_gram</code>
</p>
</td>
<td valign="top">
<p>
Maximum length of characters in a gram.  Defaults to <code class="literal">2</code>.
</p>
</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">token_chars</code>
</p>
</td>
<td valign="top">
<p>
</p>
<p>
Character classes that should be included in a token.  Elasticsearch
will split on characters that don’t belong to the classes specified.
Defaults to <code class="literal">[]</code> (keep all characters).
</p>
<p>Character classes may be any of the following:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
<code class="literal">letter</code> —      for example <code class="literal">a</code>, <code class="literal">b</code>, <code class="literal">ï</code> or <code class="literal">京</code>
</li>
<li class="listitem">
<code class="literal">digit</code> —       for example <code class="literal">3</code> or <code class="literal">7</code>
</li>
<li class="listitem">
<code class="literal">whitespace</code> —  for example <code class="literal">" "</code> or <code class="literal">"\n"</code>
</li>
<li class="listitem">
<code class="literal">punctuation</code> — for example <code class="literal">!</code> or <code class="literal">"</code>
</li>
<li class="listitem">
<code class="literal">symbol</code> —      for example <code class="literal">$</code> or <code class="literal">√</code>
</li>
<li class="listitem">
<code class="literal">custom</code> —      custom characters which need to be set using the
<code class="literal">custom_token_chars</code> setting.
</li>
</ul>
</div>

</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">custom_token_chars</code>
</p>
</td>
<td valign="top">
<p>
Custom characters that should be treated as part of a token. For example,
setting this to <code class="literal">+-_</code> will make the tokenizer treat the plus, minus and
underscore sign  as part of a token.
</p>
</td>
</tr>
</tbody>
</table>
</div>
<div class="tip admon">
<div class="icon"></div>
<div class="admon_content">
<p>It usually makes sense to set <code class="literal">min_gram</code> and <code class="literal">max_gram</code> to the same
value.  The smaller the length, the more documents will match but the lower
the quality of the matches.  The longer the length, the more specific the
matches.  A tri-gram (length <code class="literal">3</code>) is a good place to start.</p>
</div>
</div>
<p>The index level setting <code class="literal">index.max_ngram_diff</code> controls the maximum allowed
difference between <code class="literal">max_gram</code> and <code class="literal">min_gram</code>.</p>
<h3>
<a id="_example_configuration_8"></a>Example configuration<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc">edit</a>
</h3>
<p>In this example, we configure the <code class="literal">ngram</code> tokenizer to treat letters and
digits as tokens, and to produce tri-grams (grams of length <code class="literal">3</code>):</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/872.console"></div>
<p>The above example produces the following terms:</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">[ Qui, uic, ick, Fox, oxe, xes ]</pre>
</div>
</div>
<div class="navfooter">
<span class="prev">
<a href="analysis-lowercase-tokenizer.html">« Lowercase Tokenizer</a>
</span>
<span class="next">
<a href="analysis-pathhierarchy-tokenizer.html">Path Hierarchy Tokenizer »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>